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Abstract 

(N ' 

In budget-limited multi-armed bandit (MAB) problems, the learner's ac- 
tions are costly and constrained by a fixed budget. Consequently, an op- 
■ timal exploitation policy may not be to pull the optimal arm repeatedly, 

' as is the case in other variants of MAB, but rather to pull the sequence of 

^\ , different arms that maximises the agent's total reward within the budget. 

This difference from existing MABs means that new approaches to max- 
imising the total reward are required. Given this, we develop two pulling 
. policies, namely: (i) KUBE; and (ii) fractional KUBE. Whereas the for- 

mer provides better performance up to 40% in our experimental settings, 
the latter is computationally less expensive. We also prove logarithmic 
upper bounds for the regret of both policies, and show that these bounds 
are asymptotically optimal (i.e. they only differ from the best possible 
regret by a constant factor). 

> 

q\ ; 1 Introduction 



The standard m ulti-armed bandit (MAB) problem was originally proposed by 



f*^) , Robbing! (1952), and presents one of the clearest examples of the trade-off be- 



£SJ ' tween exploration and exploitation in reinforcement learning. In the standard 

MAB problem, there are K arms of a single machine, each of which delivers 
rewards that are independently drawn from an unknown distribution when an 
arm of the machine is pulled. Given this, an agent must choose which of these 
r> ■ arms to pull. At each time step, it pulls one of the machine's arms and re- 

ceives a reward or payoff. The agent's goal is to maximise its return; that is, 
the expected sum of the rewards its receives over a sequence of pulls. As the 
reward distributions differ from arm to arm, the goal is to find the arm with 
the highest expected payoff as early as possible, and then to keep playing using 
that best arm. However, the agent does not know the rewards for the arms, 
so it must sample them in order to learn which is the optimal one. In other 
words, in order to choose the optimal arm (exploitation) the agent first has to 
estimate the mean rewards of all of the arms (exploration). In the standard 
MAB, this trade-off has been effectively balanced by decision-making p olicies 



MAts, tms trade-on nas been cticctivcly balanced by decision maiang p 
such as upper confidence bound (UCB) and e n -greedy (jAuer et al . 20021) . 
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However, this MAB model gives an incomplete description of the sequential 
decision-making problem facing an agent in many real- world scenarios. To this 
end, a variety of other related models have been studied recently, and, in partic- 
ular, a number of researchers have focused on MABs with budget constraints , 
where arm-pulling is costly and is limited by a fi xed budget ( Antos et al. . 2008t 



Bubeck et al . 20091 : Guha and Munagala , 2007 ). In these models, the agent's 



exploration budget limits the number of times it can sample the arms in order 
to estimate their rewards, which defines an initial exploration phase. In the 
subsequent cost-free exploitation phase, an agent's policy is then simply to pull 
the arm with the highest expected reward. However, in many settings, it is not 
only the exploration phase, but the exploitation phase that is also limited by a 
cost budget. To address this limitation, a new bandit model, the budget-limited 



MAB, was introduced (Tran- Thanh et al. 2010l ). In this model, pulling an arm 
is again costly, but crucially both the exploration and exploitation phases are 
limited by a single budget. This type of limitation is well motivated by several 
real-world applications. For example, in many wireless sensor network appli- 
cations, a sensor node's actions, such as sampling or data forwarding, consume 
energy, and therefore the numb er of actions is limited by the capacity of the 
sensor's batteries (Padhy et al. 2010l ). Furthermore, many of these scenarios 



require that sensors learn the optimal sequence of actions that can be per- 
forme d, with the goal of max imising the long term value of the actions they 



take (jTran-Thanh et all 120111 ). In such settings, each action can be considered 
as an arm, with a cost equal to the amount of energy needed to perform that 
task. Now, because the battery is limited, both the exploration (i.e. learning 
the rewards tasks) and exploitation (i.e. taking the optimal actions given reward 
estimates) phases are budget limited. 



Against this background, Tran-Thanh et al. ( 2010f ) showed that the budget 



limited MAB cannot be derived from any other existing MAB model, and there- 
fore, previous MAB learning methods are not suitable to efficiently deal with 
this problem. Thus, they proposed a simple budget-limited e-first approach for 
the budget-limited MAB. This splits the overall budget B into two portions, 
the first eB of which is used for exploration, and the remaining (1 — e)B for 
exploitation. However, this budget-limited £-first method suffers from a num- 
ber of drawbacks. First, the performance of e-first approaches depend on the 
value of e chosen. In particular, high values guarantee accurate exploration but 
inefficient exploitation, and vice versa. Given this, finding a suitable e for a 
particular problem instance is a challenge, since settings with different budget 
limits or arm costs (which are not known beforehand) will typically require dif- 
ferent values of e. In addition, even with a good e value, the method typically 
provides poor efficiency in terms of minimising its performance regret (defined 
as the difference between its performance and that of the optimal policy), which 
is a standard measure of performance. In particular, the regret bound that e— 
first provides is O LBij, where B is the budget limit, whereas the theoretical 

best p ossible regret bound is t ypically a logarithmic function of the number of 
pullsQ (lLai and Robbind . Il985h . 



To address this shortcoming, in this paper we propose two new learning 
algorithms, called KUBE (for knapsack-based upper confidence bound explo- 



1 Note that in the budget-limited MAB, the budget B determines the number of pulls. 
Thus, a logarithmic function of the number of pulls is also a logarithmic function of the budget. 
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ration and exploitation) and fractional KUBE, that do not explicitly separate 
exploration from exploitation. Instead, they explore and exploit at the same 
time by adaptively choosing which arm to pull next, based on the current esti- 
mates of the arms' rewards. In more detail, at each time step, KUBE calculates 
the best set of arms that provides the highest total upper confidence bound of 
the estimated expected reward, and still fits into the resi dual budget, using a n 
unbounded knapsack model to determine this best set (jKellerer et 



However, since unbounded knapsack problems are known to be NP-hard, the 
algorithm uses an efficient approximation method taken from the knapsack liter- 
atu re, called the dens ity-ordered greedy approach, in order to estimate the best 



set (jKohli et all 120041 ) . Following this, KUBE then uses the frequency that each 
arm occurs within this approximated best set as a probability with which to ran- 
domly choose an arm to pull in the next time step. The reward that is received 
is then used to update the estimate of the upper confidence bound of the pulled 
arm's expected reward, and the unbounded knapsack problem is solved again. 
The intuition behind this algorithm is that if we know the real value of the arms, 
then the budget-limited MAB can be reduced to an unbounded knapsack prob- 
lem, where the optimal solution is to subsequently pull from the set of arms that 
forms the solution of the knapsack problem. Given this, by randomly choosing 
the next arm from the current best set at each time step, the agent generates an 
accurate estimate of the true optimal solution (i.e. real best set of arms), and, 
accordingly, the sequence of pulled arms will converge to this optimal set. In a 
similar vein, fractional KUBE also estimates the best set of arms that provides 
the highest total upper confidence bound of the estimated expected reward at 
each time step, and uses the frequency that each arm occurs within this approx- 
imated best set as a probability to randomly pull the arms. However, instead of 
using the density-ordered greedy to solve the underlying unbounded knapsack 
problem, fractional KUBE relies on a computatio nally less expensive ap proach, 



namely the fractional relaxation based algorithm (jKellerer et ali , 120041 ) . Given 
this, fractional KUBE requires less computation than KUBE. 

To analyse the performance of KUBE and its fractional counterpart in terms 
of minimising the regret, we devise proveably asymptotically optimal upper 
bounds on their performance regret. That is, our proposed upper bounds dif- 
fer from the best possible one only with a constant factor. Following this, 
we numerically evaluate the performance of the proposed algorithms against a 
state-of-the-art method, namely the buget-limited e-first approach, in order 
to demonstrate that our algorithms are the first that can achieve this optimal 
bound. In addition, we show that KUBE typically outperforms its fractional 
counterpart by up to 40%, however, this results in an increased computational 
cost (from O (K) to O (K In K)). Given this, the main contributions of this 
paper are: 

• We introduce KUBE and fractional KUBE, the first budget-limited MAB 
learning algorithms that proveably achieve a 0(ItlB) theoretical upper 
bound on the regret, where B is the budget limit. 

• We demonstrate that with an increased computational cost, KUBE out- 
performs fractional KUBE in the experiments. We also show that while 
both algorithms achieve logarithmic regret bounds, the buget-limited s— 
first approaches fail to do so. 
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The paper is organised as follows: Next we describe the budget-limited MAB. 
We then introduce our two learning algorithms in Section [3] In Section [?] we 
provide regret bounds on the performance of the proposed algorithms. Following 
this, Section [5] presents an empirical comparison of KUBE and its fractional 
counterpart with the e— first approach. Section [6] concludes. 

2 Model Description 

The budget-limited MAB model consists of a machine with K arms, one of 
which must be pulled by the agent at each time step. By pulling arm i, the 
agent has to pay a pulling cost, denoted with Cj, and receives a non-negative 
reward drawn from a distribution associated with that specific arm. The agent 
has a cost budget B, which it cannot exceed during its operation time (i.e. the 
total cost of pulling arms cannot exceed this budget limit). Now, since reward 
values are typically bounded in real-world applications, we assume that each 
arm's reward distribution has bounded supports. Let [ii denote the mean value 
of the rewards that the agent receives from pulling arm i. Within our model, 
the agent's goal is to maximise the sum of rewards it earns from pulling the 
arms of the machine, with respect to the budget B. However, the agent has no 
initial knowledge of the fii of each arm i, so it must learn these values in order 
to deduce a policy that maximises its sum of rewards. Given this, our objective 
is to find the optimal pulling algorithm, which maximises the expectation of the 
total reward that the agent can achieve, without exceeding the cost budget B. 

Formally, let A be an arm-pulling algorithm, giving a finite sequence of pulls. 
Let Nf - (B) be the random variable that represents the number of pulls of arm 
i by A, with respect to the budget limit B. Since the total cost of the sequence 
A cannot exceed B, we have: 

_ (B)a<B\=l. (1) 

Let G (A) be the total reward earned by using A to pull the arms. The expec- 
tation of G (A) is: K 

V[G(A)]=Y^V[Nf{B)}^. (2) 

i 

Then, let A* denote an optimal solution that maximises the expected total 
reward, that is: k 

A* = argmax^ E [Nf (£?)] fr. (3) 

^ i 

Note that in order to determine A* , we have to know the value of /Zj in advance, 
which does not hold in our case. Thus, A* represents a theoretical optimum 
value, which is unachievable in general. 

Nevertheless, for any algorithm A, we can define the regret for A as the 
difference between the expected cumulative reward for A and that of the theo- 
retical optimum A*. More precisely, letting R(A) denote the regret, we have: 

R (A) = E [G (A*)} - E [G (A)] . (4) 

Given this, our objective is to derive a method of generating a sequence of arm 
pulls that minimises this regret for the class of MAB problems defined above. 
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3 The Algorithms 



Given the model described in the previous section, we now introduce two learn- 
ing methods, KUBE and fractional KUBE, that efficiently deal with the chal- 
lenges discussed in Section [TJ Recall that at each time step of the algorithms, 
we determine the optimal set of arms that provides the best total estimated 
expected reward. Due to the similarities of our MAB to unbounded knap- 
sack problems when the rewards are known, we use techniques taken from the 
unbounded knapsack domain. Thus, in this section, we first introduce the un- 
bounded knapsack problem, and then show how to use knapsack methods in 
our algorithms. 

3.1 The Unbounded Knapsack Problem 

The unbounded knapsack problem is formulated as follows. A knapsack of 
weight capacity C is to be filled with some set of K different types of items. Each 
item type i <E K has a corresponding value Vi and weight Wi, and the problem 
is to select a set that maximises the total value of items in the knapsack, such 
that their total weight does not exceed the knapsack capacity C. That is, the 
goal is to find the non-negative integers {xi}f =l that maximise: 



K 



(5) 



K 



s.t. XiWi < C, 

i=l 

Vi e {1, . . . , K} : Xi integer. 

Note that this problem is a generalisation of the standard knapsack problem, in 
which Xi s {0, 1}; that is, each item type contains only one item, and we can 
either choose it or not. The unbounded knapsack problem is AP-hard. How- 
ever, near-optimal approximatio n methods have bee n proposed to solve it (a 



detailed survey can be found in ([Kellerer et al\ , \200<m ). Among these approxi 



mation methods, a simple, but efficient approach is the density-ordered greedy 
algorithm, and here we make use of this method. In more detail, the density- 
ordered greedy algorithm has O (K log K) comp utational complexity, where K 



is the number of item types (jKohli et all 120041) . This algorithm works as fol- 



lows. Let v i/wi denote the density of type i. To begin, the item types are sorted 
in order of their density, which is an operation of O (K log K) computational 
complexity. Next, in the first round of this algorithm, as many units of the 
highest density item are selected as is feasible without exceeding the knapsack 
capacity. Then, in the second round, the densest item of the remaining feasible 
items is identified, and as many units of it as possible are selected. This step is 
repeated until there are no feasible items left (i.e. at most K rounds). 

Another way to approximate the optimal solution of the unbounded knap- 
sack problem is the fractional relaxation based algorithm. This relaxes the 
original problem to its fractional version. In particular, within the fractional 
unbounded knapsack problem we allow Xi to be fractional. Now, it is easy to 
show that the optimal solution of the fractional unbounded knapsack is to solely 
choose I* = argmaxi"i/ic, (i.e. I* is the item type with the highest density) 



5 



(|Kellerer et all 120041 ) . That is, if x* = ( x*, . . . ,x*) denotes the optimal solu- 
tion of the fractional unbounded knapsack, then x*j» — c /w I * 1 while Vj 7^ J*, 
Xj = 0. Given this, within the original unbounded knapsack problem (where Xi 
are integers), the fractional relaxation based algorithm chooses xj* — [ c /wj,\, 
and Xj = 0, Vj ^ I*. It can easily shown that the complexity of this algorithm 
is O (K), which is the cost of determining the highest density type. 



3.2 KUBE 

The KUBE algorithm is depicted in Algorithm [1] Here, let t denote the time 
step, and B t denote the residual budget at time t > 1, respectively. Note that 
at the start (i.e. t = 1), B\ = B, where B is the total budget limit. At each 
subsequent time step, t, KUBE first checks that arm pulling is feasible. That 
is, it is feasible only if at least one of the arms can be pulled with the remaining 
budget. Specifically, if B t < mim, Cj (i.e. the residual budget is smaller than 
the lowest pulling cost), then KUBE stops (steps 3 — 4). 

If arm pulling is still feasible, KUBE first pulls each arm once in the initial 
phase (steps 6 — 7). Following this, at each time step t > K , it estimates the best 
set of arms according to their upper confidence bound using the density-ordered 
greedy approximation method applied to the following problem: 



K ( 

i=l V 



/21nf 

nit 



(6) 



K 



s.t. 2_, mi ' tCi < B u Vi,t : m ijt integer. 



In the above expression, [ii^ ni t is the current estimate of arm i's expected reward 
(calculated as the average reward received so far from pulling arm i), n^.t is 

the number of pulls of arm i until time step t, and ^/^Jsl j s the size of the 

upper confidence interval. The goal, then, is to find integers {m,i t t}ieK such 
that Equation [5] is maximised, with respect to the residual budget limit B t 
(n.b. from here on, we drop the subscript i € K on this set). Since this problem 
is NP-hard, we use the density-ordered greedy method to find a near-optimal 
set of arms (step 9). Note that the upper confidence bound on arm i's density 
is: 

/ 2 In t 

h^j.t + V ^ 

Ci Ci 

Let M*(B t ) = {m* t } be this method's solution to the problem in Equation [6j 
giving us the desired set of arms, where m* t is an index of arm i that indicates 
how many times arm i is taken into account within the set. Using {m* t }, KUBE 
randomly chooses the next arm to pull, i(t), by selecting arm i with probability 
(step 10): 

P(i(t) = i)= ™ l \ . (8) 

E fe =i m h 

After the pull, it then updates the estimated upper bound of the chosen arm, 
and the residual budget limit B t (steps 12 — 13). 
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Algorithm 1 The KUBEAlgorithm 



l: t = l;B t = B; 7 >0; 

2: while pulling is feasible do 

3: if B t < minj Ci then 

4: STOP! {pulling is not feasible} 

5: end if 

6: if t < K then 

7: Initial phase: play arm i (t) = t; 

8: else 

9: use density-ordered greedy to calculate M*(B t ) = {m* t }, the solution 
of Equation ® 

10: randomly pull i (t) with P (i (t) = i) = , ; 

2-f*=l m k,t 

11: end if 

12: update the estimated upper bound of arm i (t); 

13: B t+1 =B t -Ci(ty, t = t + l; 

14: end while 



The intuition behind KUBE is the following. By repeatedly drawing the next 
arm to pull from a distribution formed by the current estimated approximate 
best set, the expected reward of KUBE equals the average reward for following 
the optimal solution to the corresponding unbounded knapsack problem, given 
the current reward estimates. If the true values of the arms were known, then 
this would imply that the average performance of KUBE efficiently converges 
to the optimal solution of the unbounded knapsack problem reduced from the 
budget-limited MAB model. It is easy to show that the optimal solution of 
this knapsack model forms the theoretical optimal policy of the budget-limited 
MAB in case of having full information. Put differently, if the mean reward value 
of each arm is known, then the budget-limited problem can be reduced to the 
unbounded knapsack problem, and thus, the optimal solution of the knapsack 
problem is the optimal solution of the budget-limited MAB as well. In addition, 
by combining the upper confidence bound with the estimated mean values of 
the arms, we guarantee that an arm that is not yet sampled many times may 
be pulled more frequently, since its upper confidence inte rval is large. Thus, we 
explore and explo it at the same time (for more details, see ( Audibert et all 20091 



Auer et all 120021) ). Note that, by using the density-ordered greedy method, 



KUBE achieves a O {K\a.K) computational cost per time step. 
3.3 Fractional KUBE 

We now turn to the fractional version of KUBE, which follows the underlying 
concept of KUBE. It also approximates the underlying unbounded knapsack 
problem at each time step t in order to determine the frequency of arms within 
the estimated best set of arms. However, it differs from KUBE by using the 
fractional relaxation based method to approximate the unbounded knapsack in 
Step 9 of Algorithm!!] Crucially, fractional KUBE uses the fractional relaxation 
based algorithm to solve the following fractional unbounded knapsack problem 
at each t: 
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max 



m itt [ M»,Tn,t + \l J s - t - 2-i m i,t<k < B t . (9) 



Recall that within KUBE, the frequency of arms within the approximated solu- 
tion of the unbounded knapsack forms a probability distribution from which the 
agent randomly pulls the next arm. Now, since the fractional relaxation based 
algorithm solely chooses the arm (i.e. item type) with the highest estimated 
confidence bound-cost ratio (i.e. item density), fractional KUBE does not need 
to randomly choose an arm. Instead, at each time step t, it pulls the arm that 
maximises ^."i.t/ci + \[^fj/c^j . That is, fractional KUBE can also be seen as 



the budget-limited version of UCB (see (|Auer et all [2002) for more details of 
UCB). 

Computation-wise, by replacing the density-ordered greedy with the frac- 
tional relaxation based algorithm, fractional KUBE decreases the computational 
cost to O (K) per time step. In what follows, we show that both KUBE and its 
fractional counterpart achieve asymptotically optimal regret bounds. 



4 Performance Analysis 

We now focus on the analysis of the expected regret of KUBE and fractional 
KUBE, defined by Equation 2] To this end, in this section we: (i) derive an 
upper bound on the regret of the algorithms, and (ii) show that these bounds 
are asymptotically optimal. 

To begin, let us state some simplifying assumptions and define some useful 
terms. Without loss of generality, for ease of exposition we assume that the 
reward distribution of each arm has support in [0, 1], and that the pulling cost 
Cj > 1 for each i (our result can be scaled for different size supports and costs 
as appropriate). Let I* — argmax^ M;/ Ci be the arm with the highest true mean 
value density. For the sake of simplicity, we assume that I* is unique (however, 
our proofs do not exploit this fact at all). Let d m i n — miiij^j. {m/»/ Cj . — h/cj} 
denote the minimal true mean value density difference of arm I* and that of 
any other arm j. In addition, let c m i n = min^ Cj and c max = max-,- Cj denote the 
smallest and largest pulling costs, respectively. Then let 5j — Cj — cj* be the 
difference of arm j's pulling cost and the minimal pulling cost. Similarly, let 
Aj = — jij denote the difference of the highest true mean value and that of 
arm j. Note that both Sj and Aj could be negative values, since I* does not 
necessarily have the highest true mean value, nor the smallest pulling cost. In 
addition, let T denote the finite-time operating time of the agent. 

Now, we first analyse the performance of KUBE. In what follows, we first 
estimate the number of times we pull arm j =/= I* , instead of I* . Based on this 
result, we estimate E [T], the average number of pulls of KUBE. This bound 
guarantees that KUBE always pulls "enough" arms so that the difference of the 
number of pulls in the theoretical optimal solution and that of KUBE is small, 
compared to the size of the budget. By using the estimated value of E [T], we 
then show that KUBE achieves a O (In (£?)) worst case regret on average. In 
more detail, we get: 

Theorem 1 (Main result 1) For any budget size B > 0, the performance 
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regret of KUBE is at most: 




It is easy to show that for each j ^ /*, at least one between Sj and Aj has 
to be positive. This implies that (X)a 3 >o + H<5.,>o > ^- That is, the 
performance regret of KUBE (i.e. J? (KUBE)) is upper-bounded by 0(ln£>). 
To prove this theorem, we will make use of the following version of the Chernoff- 
Hocffding concentration inequality for bounded random variables: 



Theorem 2 (Chernoff Hoeffding inequality (jHoeffdind, [l963)) LetX x , 



X2, ■ ■ ■ ,X n denote the sequence of random variables with common range [0, 1], 
such that for any 1 < t < n, we have ~E [Xt\X±, . . . , Xt-i] = /i. Let S n = 
— Y^t=i Xt- Given this, for any 5 > 0, we have: 

P(S n > f i + S)<e- 2n5 \ 
P{S n <^-8)<e- 2nS \ 

The proof can be found, for example, in lHoeffdineHl96aj) . 

We now focus on the performance analysis of KUBE. To this end, we in- 
troduce some further notation. Let T denote the number of pulls of KUBE. In 
addition, let Nj (T) denote the number of times KUBE pulls arm j up to time 
step T. 

In what follows, we first devise an upper bound for Nj (T) for all j ^ I* . 
That is, we estimate the number of times we pull arm j ^ I*, instead of I*. 
Based on this result, we estimate the average number of pulls of KUBE (i.e. 
E [T]). This bound guarantees that KUBE always pulls "enough" arms so that 
the difference between the number of pulls in the theoretical optimal solution 
and that of KUBE is small, compared to the size of the budget. By using the 
estimated value of E [T], we then show that KUBE achieves a O (In (£?)) worst 
case regret on average. We now state the following: 

Lemma 3 Suppose that KUBE pulls the arms T times. If j 7^ I*, then: 
E [Nj (T) \T] <(^+ (^) 2 ) In (T) + ^ + 1. 

\ "min \ C min / ) J 

That is, the number of times KUBE pulls an arm j ^ I* is at most O (In (T)). 
To prove this lemma, let us first refresh some of the terms that are used: i (i) 
is the arm pulled by KUBE at time t; when refering to a set of arms {mj,t}, 
rrijj is the number of pulls of arm j; M*(B t ) — {m* t } is the density-ordered 
greedy approximate solution to unbounded knapsack problem in Equation 6, 
where m* t is the number of arm i's pulls in this set; and I* — argmax^ ^ i 

is the arm with the highest true mean value density. In addition, I (t) — 

1 Aj.njv 




arg maxj ^ — j- 2 ^ + c 3 ' j is the arm with the highest estimated density confi- 
dence bound at time step t. In order to prove Lemma|31 we rely on the following 
lemmas: 
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Lemma 4 Suppose that the total number of pulls KUBE makes of the arms 
is T, and that at each time step t, the residual budget is B t (note that here 
B\ = B). For any <t <T, we have: 



B t ~ T-t + 1 

Lemma 5 Suppose that the total number of pulls KUBE makes of the arms is 
T . For any < t < T , we have: 



P(i (t)=j\T)<P[l(t)=j\T 



\ 2 i 



T-t + 1 



Proof of Lemma^ At the beginning of time step t, the residual budget is B t . 
Since the total number of pulls is T, with respect to B t , KUBE can still make 
T — t + 1 pulls (including the pull at time step t). This indicates that: 

B t > cut) + Cj (t+1 ) H h Cj( T ) > (T - t + 1) c min . 

which directly implies the inequality in Lemma U] 

Proof of Lemma [5[ We assume that the value of T is given. For the slight 
abuse of notation, we drop the conditional of T notation to simplify the proof 
(i.e. all the probabilities are considered to be conditional to T), and we will 
explicitly denote it when necessary. First, we consider a particular value of B t . 
Thus, we have: 

P(i(t)=j\B t )= P(i(t)=j\M*(B t ) = {m itt })P(M*(B t ) = {m i , t }). 

(10) 

Recall that the density-ordered greedy approach first repeatedly adds arm / (t) 
to set {mi^} until it is not feasible. It is easy to show that after adding arm I (t) 
as many times as possible (i.e. rnj^ t times) to the set, the residual budget is at 

most Cj^ (or otherwise we could still add arm / (t) one more time). Therefore: 

C m i n 

ijti{t) 

That is, the total count of arm pulls other than / (t) in the set is at most a - 
This inequality comes from the fact that we can construct a set with the greatest 
number of arm pulls by only adding the arm with the smallest cost. Similarly, 
we have: 

E m M > — L > (12) 

, | C max 
K—l 

because we can construct a set with the smallest number of arm pulls by only 
adding the arm with the greatest cost. Combining Equations ITU and IT?1 gives: 

J2i^I(t) m i,t < ^~ < / C max \ 2 Cmin ^ 13 ^ 



k=l m k,t ^ 
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The last inequality is obtained from the fact that Cfu\ < c max . Now, recall that 
KUBE chooses arm j to pull with probability „ J" 3,f — . This implies that: 

P(i (t)=j\M* (B t ) = {m M }) 

= P (i (t) = j, I (t) = j\M* (B t ) = {m i;t }) 

+ P (i (t) = j, I (t) ? j\M* {B t ) = {m iit }) . 
This can be upper bounded by: 

P(i (t)=j\M* (Bt) = {rm,t}) 

< ™ im P (i (t) = j\M* (B t ) = { mi , t }) (14) 



Eft 
k=l m k 

J2j^I(t) m i,t 



PI 



: (t)^j\M* (B t ) = {m i ,t}) 



The right hand side can be further upper bounded as follows 
P(i (t)=j\M* (B t ) = {m lt t}) 



<p(l(t)=j\M* (B,.) = {m M }) + ^ 



Lfc=l m k,t 



< P 



(l(t) = j\M* (B t ) = {m, t }) + (^£) 2 2=±. (15) 



The last inequality is obtained from Eauation ll3l Substituting Equation [T5] into 
Equation [10] gives: 



P(i(t)=j\B t )< £ ( P ( / W=^ M *( i3 ') = ^ t >) + (^J ^)P(M*(B t ) = {m i>t }) 



2 



< P 



(Ht)=j\Bi 



2 

Cmax \ C m in 



Cmin / B t 
2 



mm 



The last inequality is obtained from Lemma 2] Now we study the general case, 
where B t is not fixed. By summing up Equation [16] over all possible value of 
B t , we have: 



P (i (i) = j|T) = £ P (i (t) = j|T, B t ) P (B t \T) 

Bt 

< E (P (/ (*) = i\T, B t ) + (^) 2 P (P t |T) 

<p(nt)=j\T) + (^) 2 ¥ ^ T - 1 . (ir) 



which concludes the proof. 

Based on Lemmas [4] and [5j Lemma [3] can be proved as follows: 
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Proof of Lemma[3[ We assume that the value of T is already given. Again, for 
the slight abuse of notation, we drop the conditional of T notation to simplify 
the proof, and we will explicitly denote it when necessary. In this case, the proof 
of the theorem for that particula r value of T is along the same lines as that of 
Theorem 1 of (jAuer et all 12002). In particular, recall that Nj (T) denotes the 
expectation of number of times KUBE pulls an arm j =^ I* until time step T. 
Given this, we have the following: 

T 

E[N J (T)}=1+ P(i(t)=j) 

t-K+l 

i E p(i{t)=j)+ E /, ''- sV 1 



Cm in / T t ~\- 1 

<; + £ p(i(t)=j,N j (t)>i)+ J2 (- 



t=K+l t=K+l 

T T , N 2 



t=K+l t=K+l 



1 



T-t + 1 
(18) 



for any I > 1. Now, let bt <s = y^f^- Considering the second term on the right 
hand side of Equation [18J we have: 



t=K+l t—K+l 

T 



T, p(i(t) = j,N d (t)>i)= V p ( h *' N " (t) + hiEnML < + b J^l !N . {t) > 



< " I mm i 1 — r < max ■! — + ■ 



t=K+l 
T t 



<s<t I cr* ej* I i<sj<t I c 7 c 



< y^ y^ pi ^"•■ s + ^± < + bt > s j 

(19) 

If it is true that + < thli. _|_ a ^ l eas t one of the following 

three statements must also hold: 

A£V + ^£<M£1 (20) 

Cj« Cj. C/» 

^<^ + ^i, (21) 

Cj Cj Cj 

^£1<^ + ^. (22 ) 

Cj. Cj Cj 



That is, we get: 



Cj* Cf* Cj Cj J \ Ci* Cl* Cj 



CJ* Cj 

(23) 
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Applying the Chernoff-Hoeffding inequalities to the first two terms on the right 
hand side of Equation [23] gives: 

P (t±-l + hl < \ = p(/J J . +bt , s <Hi*) <exp{-26 t 2 s s} = exp{-41nf}=t- 
V c/« c/- ci* J 

(24) 

/'('— • — • ^) = (H < Hi, Sj + &t,.,) < exp {-26? = exp{-41nt} - 

(25) 

On the other hand, for 2 > 8 d 1 " T , Equation [22] is false, since: 

2b, 



Hr> m 



> 


Hi* 


_ 




Ci* 


c j 


> 


Hi* 


_ 




Cj. 


c i 


> 




_ H± 




C/« 


c j 


> 




_ H± 




Cj. 


c i 


> 




_ H± 




Cj. 


c i 


81nT 
j2 ! 


and 



2b t 



mm 



2. 



<21ni 

81nT 



^min 

di = 0. (26) 

Here note that c,- > 1, Sj > / > and t < T. If Z > then 

p (W < El + = o. Substituting this and Equations |2U] |M1 and |2^1 

\ c i* c j c j j 

into Equation [T^l gives: 

T T t t 2 

£ p(i(t)=j,Ni(t)>i) <EEE 2r4 <y' (27) 

for any I > S jF T ■ Note that the last inequality is obtained from the Riemann 

Zeta Function for value of 2 (i.e. J^^Li *" 2 = if) jlvi3 . Il985h . 

Now, consider the third term on the right hand side of Equation [151 By 
using Lemma El we get: 

t( C f^) 2 jr^<( C f^) 2 ln(T). (28) 

We now combine Equations 1271 and 1281 together, and we set I = + 1, which 
gives: 

Sin 7 , 7T 2 f c n 1 1 



V[Ni(T)}< r + l + -+lf^ ln(T) 

a min V C min / 

for any given value of T, which concludes the proof. From Lemma [3J we can 
show the following: 
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Lemma 6 Suppose that the total budget size is B . IfT denotes the total number 
of pulls of KUBE then we have: 



E [T] > 



D 



1 -1 



where E [T] is the expected number of pulls using KUBE. 

That is, the difference between — and the number of pulls of KUBE is at most 
Proof of Lemma® Since KUBE pulls arms until none are feasible, by definition: 



p ^E Ci (*) - B ~ Cmin ^ 



Taking the expectation of J2t=i c i(t) over T an< ^ { m j,t} (i-e. the set of i(t)) 
gives: 



B — c min < E T ,{i( t )} 



' T 

E c *(*) 



< E 7 



< E 7 



.4=1 
T K 



E7 



i( t )j 



£5>P(i(t)=j|T) 
t=i 3=1 



53(c/. + X;^(i(i)=i|r) 

<5,->0 



< E T [T] c/. +E T 



^(^P(i(t) = j|T)) 

<5 3 >0 Vt=l / 



< E T [T] C/ . 



< E T [T] cj. 



5> 

<5j>0 



<5j>0 



In 



ln(T) 
B 



— + 1 



7T J 



(29) 



(30) 



Equation 021] is obtained from Lemma 02 while Equation comes from the fact 
that T < with probability 1. In addition, the third inequality is obtained 
from the fact that 6j can be smaller than for some j, and thus, we can further 
upper bound by only summing up SjP (i (t) = j\T) over arms that have Sj > 0. 
Now, by dividing both sides with c/* , we obtain: 



B 



<5,>0 



In 



B 



1 < E T [T] 



By using the fact that < 1, we obtain the stated formula. Note that if 
we relax the budget-limited MAB problem so that the number of pulls can be 
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fractional, then it is easy to show that the optimal pulling policy of this relaxed 
model is to repeatedly pull arm I* only. In this case, is the number of 
pulls of this optimal policy. Lemma |5] indicates that the number of pulls that 
KUBE produces does not significantly differ from that of the optimal policy of 
the fractional budget-limited MAB (i.e. the difference is a logarithmic function 
of the number of pulls). We can now derive the regret bound of KUBE from 
Lemma [5] as follows: 

Proof of Theorem^ Recall that E [G B {A*)] denotes the expected performance 
of the theoretical optimal policy. It is obvious that E [G B (A*)] < B ^'J , since 
the latter is the optimal solution of the fractional budget-limited MAB problem. 
This indicates that: 



R B (KUBE) = E [G B {A*)] - E [G B (KUBE)] 

T 



< Bvi 


E 


~ Cj* 




< Bin 


E 


~ Ci* 






'B/u, 

Cj« 




Bin. 

Ci* 


<C E7 1 


'(- 

Vcj. 


<C E7 1 


" B 

_Cj* 


<C Ej 1 


' B 



T,{i(t)} 



E^ 

.4=1 



(*) 



.4=1 
T 

_ E E *(«) [Mi(t)] 
4=1 

T if 

-EE^ p (*(*)=-?i T ) 

4=1 j 



Mi" 



E7 



E7 



E E A i P(i(t) = j|T) 

4=1 Aj>0 



E AjE [JV, (T) |T] 

Aj>0 



(31) 



Note that since Aj can be smaller than for some arm j, we can further upper 
bound R B (KUBE) by only summing up AjE [Nj (T) \T] over arms with Aj > 
(see the last two inequalities). Applying Lemma[5]to the first term and Lemma[3] 
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to the second term on the right hand side of Equation |3"T1 gives: 



R B (KUBE) < 



d 2 

"min 



+ 



Y iL In 

<5,>0 



E7 



E ^ 

A 3 >0 



d 2 

"min 



m(T) 



<5, >0 



/ij. 



< 



d 2 • 

^min 

E A , 

Aj>0 



V ^ln 

<5 7 >0 



<5 3 >0 v 



1 + 



dl 



In 



I? 



7T 

T + 1 



which concludes the proof. Note that the last equation is obtained from the 
facts that fij* < 1 and T < with probability 1. 

In a similar vein, we can show that the regret of fractional KUBE is bounded 
as follows: 

Theorem 7 (Main result 2) For any budget size B > 0, the performance 
regret of fractional KUBE is at most 



8 
dl. 



E^ + E 



in 



Sj>0 



CI* 



B 

Cm in 



E A. + E 



,Aj>0 



<5,>0 




+ 1+1- 



Proof of Theorem [?] We follow the concept that is similar to the proof of 
Theorem [TJ Given this, we only highlight the steps that are different from the 
previous proofs. For the sake of simplicity, we use the notations previously 
introduced for the performance analysis of KUBE. In particular, let T denote 
the random variable that represents the number of pulls that fractional KUBE 
uses. Let Nj (T) denote the number of times that the corresponding pulling 
algorithm pulls arm j up to time step T. Similar to Lemma |3l we first show 
that within the fractional KUBE algorithm, we have: 



E[JV,-(r)|T]< J-ln(T) - K ~ 



d 2 



1. 



(32) 



In so doing, note that 



E[7V 7 -(T)|T] = 1+ J2 P(i(t)=j\T)<l+ J2 P(i(t)=j,N j (t)>l\T) 

(33) 



t=K+l 



t=K+l 



for any I > 1. Now, using similar techniques from the proof of Lemma [H we 
can easily show that 

T T t t 2 

E P ( f (*) = 3, Nj (t)>l\T)<J2J2J2 2r * ^ T' 



t=K+l 



t — 1 S=l Sj—1 
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for any I > 



81nT 




By substituting this into Equation 1331 we obtain Equa- 



tion [32j Next, we show that 



B 



Sj>0 v 7 



(34) 



This can be derived from Equation [321 by using techniques similar to the proof 
of Lemma [6l This implies that 



R B (KUBE) = E [G B {A*)] - E [G B (KUBE)] 

" T 



<^l-E 



< 



T > {*(*)} 

T 



,t=l 



E 7 



Bin 



T 

X] E i(t) [/i l(t) ] 



t=i 



< E 7 



< E 7 



< E 7 



< E 7 



Bin 



T K 



£5>,P(i(t)=j|T) 



A" 



— - T 



— — T 



/if* 



At/* 



E7 



E7 



£ 2 A i P(i(t)=j|T) 

t=l Aj>0 



2 A,E [^-(T^T] 

Aj>0 



By substituting Equations [33] and [JH] into this, we obtain 



R B (KUBE) < J- ]T ^- In ( — 

^min 5 > q \Cmin 



Aj>0 



8 h/ B 



d 2 ■ 

"min 



£ 

5 3 >0 



2 

cj» \ 3 
+ 1 



1+1- 



(35) 



which concludes the proof. 

Having established a regret bound for the two algorithms, we now move on 
to show that they produce optimal behaviour, in terms of minimising the regret. 
In more detail, we state that: 

Theorem 8 (Main result 3) For any arm pulling algorithm, there exists a 
constant C > 0, and a particular instance of the budget-limited MAB problem, 
such that the regret of that algorithm within that particular problem is at least 
ClnB. 
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Proof of Theorem^ By setting all of the arms' pulling costs equal to c > 0, any 
standard MAB problem can be reduced to a budget-limited MAB. This implies 
that the number of pulls with in this MAB is guarante ed to be — = T (i.e. T is 
deterministic) . According to (jLai and Robbinsl . Il985l) , the best possible regret 
that an arm pulling algorithm can achieve within the domain of standard MABs 
is Cln(T). Therefore, if there is an algorithm within the domain of budget- 
limited that provides better regret than Cln(-^-)=ClnT, then it also provides 
better regret bounds for standard MABs. 

The results in Theorem [T] and [7] can be interpreted to the standard MAB 
domain as follows. The standard MAB can be reduced to a budget-limited MAB 
by setting all the pulling costs to be the same. Given this, s /c min = T in any 
sequence of pulls. This implies that both KUBE and fractional KU BE achieve 



O (In T) regret within the sta ndard MAB domain, which is optimal (|Auer et al. 
20021 lLai and Robbinsl . Il985h . 



Note that the regret bound of fractional KUBE is better (i.e. the con- 
stant factor within the regret bound of fractional KUBE is smaller than that 
of KUBE). However, this does not indicate that fractional KUBE has better 
performance in practice. One possible reason is that these bounds are not tight. 
In fact, as we will demonstrate in Section (3J KUBE typically outperforms its 
fractional counterpart by up to 40%. 



5 Performance Evaluation 

In the previous section, we showed that the two algorithms provide asymptoti- 
cally optimal regret bounds, and that the theoretical regret bound of fractional 
KUBE is tighter than that of KUBE. In addition, we also demonstrated that 
fractional KUBE outperforms KUBE in terms of computational complexity. 
However, it might be the case that these bounds are not tight, and thus, frac- 
tional KUBE is less practical than KUBE in real-world applications, as is the 
case with the standard MAB algorithm, where simple but not optimal methods 
(e.g. e-first, or e-greedy) typica lly outperform more advan ced, theoretically 
optimal, algorithms (e.g. POKER( Vermorel and Mohri . 20051) . or UCB). Given 



this, we now evaluate the performance of both algorithms through extensive sim- 
ulations, in order to determine their efficiency in practice. We also compare the 
performance of the proposed algorithms against that of different budget-limited 
e-first approaches. In particular, we show that both of our algorithms outper- 
form the budget-limited e-first algorithms. In addition, we also demonstrate 
that KUBE typically achieves lower regret than its fractional counterpart. 

Now, note that if the pulling costs are homogeneous — that is, the pulling 
cost of the arms do not significantly differ from each other — then the perfor- 
mance of the density-ordered greedy al gorithm does not sign ificantly differ from 
that of the fractional relaxation based ( Kellerer et all 120041 ) . Indeed, since the 



pulling costs are similar, it is easy to show that the density-ordered greedy ap- 
proach typically stops after one round, and thus, results in similar behaviour to 
the fractional relaxation based method. On the other hand, if the pulling costs 
are more diverse (i.e. the pulling costs of the arms differ from each other), then 
the performance of the density-ordered greedy algorithm becomes more efficient 
than that of the fractional relaxation based algorithm. Given this, in order to 
compare the performance of KUBE and its fractional counterpart, we set three 
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test cases, namely: bandits with (i) homogeneous pulling costs; (ii) moderately 
diverse pulling costs; and (iii) extremely diverse costs. In particular, within the 
homogeneous case, the pulling costs are randomly and independently chosen 
from the interval [5,10]. In addition, the pulling costs are set to be between 
[1, 10] within the moderately diverse case, and between [1, 20] in the extremely 
diverse case, respectively. The reward distribution of each arm i is set to be a 
truncated Gaussian, with mean /ii, randomly taken from interval [10,20], vari- 
ance erf = Hf, and with supports [0, 2/Xj]. In addition, we set number of arms 
K to be 100. 

Our results are shown in Figure [1] These plots show the performance of 
each algorithm divided by ln-^-, and the error bars represent the 95% confi- 
dence intervals. By doing this, we can see that the performance regret of both 
algorithms is O (\n j , since in each test case, their performance converges to 

C In -2— (after it is divided by In -^—), where C is some constant factor. From the 
numerical results, we can see that both KUBE and fractional KUBE differ from 
the best possible solution by small constant factors (i.e. C), since the limit of 
their convergence is typically low (i.e. it varies between 4 and 7 in the test cases), 
compared to the regret value of the algorithm. In addition, we can also see that 
fractional KUBE algorithm is typically outperformed by KUBE. The reason is 
that the density-ordered greedy algorithm provides a better approximation than 
the fractional relaxation based approach to the underlying unbounded knapsack 
problem. This implies that KUBE converges to the optimal pulling policy faster 
than its fractional counterpart. In particular, as expected, the performance of 
the algorithms are similar to each other in the homogeneous case, where the 
density-ordered greedy method shows similar behaviour to the fractional relax- 
ation based approach. In contrast, KUBE clearly achieves better performance 
(i.e. lower regret) within the diverse cases. Specifically, within the moderately 
diverse case, KUBE outperforms its fractional counterpart by up to 40% (i.e. 
the regret of KUBE is 40%: lower than that of the fractional KUBE algorithm). 
In addition, the performance improvement of KUBE is typically around 30% in 
the extremely diverse case. This implies that, although the current theoretical 
regret bounds are asymptotically optimal, they are not tight. 

Apart from this, we can also observe that both of our algorithms outper- 
form the budget-limited £-first approaches. In particular, KUBE and its frac- 
tional counterpart typically achieves less regret by up to 70% and 50% than 
the budget-limited e-first approaches, respectively. Note that the performance 
of the proposed algorithms are typically under the line 0(B3 (ln_B) _1 ), while 
the budget-limited e-first approaches achieve larger regrets. This implies that 
our proposed algorithms are the first methods that achieve logarithmic regret 
bounds. 

6 Conclusions 

In this paper, we introduced two new algorithms, KUBE and fractional KUBE, 
for the budget-limited MAB problem. These algorithms sample each arm in 
an initial phase. Then, at each subsequent time step, they determine a best 
set of arms, according to the agent's current reward estimates plus a confidence 
interval based on the number of samples taken of each arm. In particular, KUBE 
uses the density-ordered greedy algorithm to determine this best set of arms. In 
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contrast, fractional KUBE relies on the fractional relaxation based algorithm. 
KUBE and its fractional counterpart then use this best set as a probability 
distribution with which to randomly choose the next arm to pull. As such, both 
algorithms do not explicitly separate exploration from exploitation. We have 
also provided a O In (B) theoretical upper bound for the performance regret 
of both algorithms, where B is the budget limit. In addition, we proved that 
the provided bounds are asymptotically optimal, that is, they differ from the 
best possible regret by only a constant factor. Finally, through simulation, we 
have demonstrated that KUBE typically outperforms its fractional counterpart 
up to 40%, however, with an increased computational cost. In particular, the 
average computational complexity of KUBE per time step is O (K In K) , while 
this value is O (K) for fractional KUBE. 

One of the implications of the numerical results is that although fractional 
KUBE has a better bound on its performance regret than KUBE, the latter 
typically ourperforms the former in practice. Given this, our future work con- 
sists of improving the results of Theorems [T] and [7] to determine tighter upper 
bounds can be found. In addition, we aim to extend the budget-limited MAB 
model to settings where the reward distributions are dynamically changing, as 
is the case in a numer of real-world problems. This, however, is not trivial, 
since both of our algorithms rely on the assumption that the expected value of 
the rewards is static, and thus, the estimates converge to their real value. 
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Figure 1: Performance regret of the algorithms, divided by In ^ - B j , /or a 100-armed 
bandit machine with homogeneous arms, moderately diverse arms, or extremely diverse 
arms (left to right). 
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